Using Locality and Interleaving Information to Improve Shared Cache Performance
نویسنده
چکیده
Title of dissertation: Using Locality and Interleaving Information to Improve Shared Cache Performance Wanli Liu, Doctor of Philosophy, 2009 Dissertation directed by: Professor Donald Yeung Department of Electrical and Computer Engineering The cache interference is found to play a critical role in optimizing cache allocation among concurrent threads for shared cache. Conventional LRU policy usually works well for low interference workloads, while high cache interference among threads demands explicit allocation regulation, such as cache partitioning. Cache interference is shown to be tied to inter-thread memory reference interleaving granularity: high interference is caused by fine-grain interleaving while low interference is caused coarse-grain interleaving. Profiling of real multi-program workloads shows that cache set mapping and temporal phase result in the variation of interleaving granularity. When memory references from different threads map to disjoint cache sets, or they occur in distinct time windows, they tend to cause little interference due to coarse-grain interleaving. The interleaving granularity measured by runlength in workloads is found to correlate with the preference of cache management policy: fine-grain interleaving workloads perform better with cache partitioning, and coarse-grain interleaving workloads perform better with LRU. Most existing shared cache management techniques are based on working set locality analysis. This dissertation studies the shared cache performance by taking both locality and interleaving information into consideration. Oracle algorithm which provides theoretical best performance is investigated to provide insight into how to design a better practical policy. Profiling and analysis of Oracle algorithm lead to the proposal of probabilistic replacement (PR), a novel cache allocation policy. With aggressor threads information learned on-line, PR evicts the bad locality blocks of aggressor threads probabilistically while preserving good locality blocks of non-aggressor threads. PR is shown to be able to adapt to the different interleaving granularities in different sets over time. Its flexibility in tuning eviction probability also improves fairness among thread performance. Evaluation indicates that PR outperforms LRU, UCP, and ideal cache partitioning at moderate hardware cost. For single program cache management, this dissertation also proposes a novel technique: reuse distance last touch predictor (RD-LTP). RD-LTP is able to capture reuse distance information, which represents the intrinsic memory reference pattern. Based on this improved LT predictor, an MRU LT eviction policy is developed to select the right victim at the presence of incorrect LT prediction. In addition to LT predictor, another predictor: reuse distance predictors (RDPs) is proposed, which is able to predict actual reuse distance values. Compared to various existing cache management techniques, these two novel predictors deliver higher cache performance with higher prediction coverage and accuracy at moderate hardware cost. Using Locality and Interleaving Information to Improve Shared Cache Performance
منابع مشابه
A Composable Model for Analyzing Locality of Multi-threaded Programs
In a multi-threaded execution, threads may negatively interfere when their private data contends for shared cache or positively interact when the data brought in by one thread is used by other threads. This paper presents a model of such cache behavior to predict locality without exhaustive simulation and provide insight into trends. The new model extends prior work that assumes no data sharing...
متن کاملDistance Analysis for Large - Scale Chip Multiprocessors
Title of dissertation: REUSE DISTANCE ANALYSIS FOR LARGE-SCALE CHIP MULTIPROCESSORS Meng-Ju Wu, Doctor of Philosophy, 2012 Dissertation directed by: Professor Donald Yeung Department of Electrical and Computer Engineering Multicore Reuse Distance (RD) analysis is a powerful tool that can potentially provide a parallel program’s detailed memory behavior. Concurrent Reuse Distance (CRD) and Priva...
متن کاملCode Placement using Temporal Profile Information
Instruction cache performance is important to instruction fetch efficiency and overall processor performance. The layout of an executable has a substantial effect on the cache miss rate and the instruction working set size during execution. This means that the performance of an executable can be improved significantly by applying a code-placement algorithm that minimizes instruction cache confl...
متن کاملThe Best Distribution for a Parallel OpenGL 3D Engine with Texture Caches
The quality of a real-time high end virtual reality system depends on its ability to draw millions of textured triangles in 1/60s. The idea of using commodity PC 3D accelerators to build a parallel machine instead of custom ASICs seems more and more attractive as such chips are getting faster. If image parallelism is used, designers have the choice between two distributions: line interleaving a...
متن کاملA Generic Tool Supporting Cache Design and Optimisation on Shared Memory Systems
For multi-core architectures, improving the cache performance is crucial for the overall system performance. In contrast to the common approach to design caches with the best trade-off between performance and costs, this work favours an application specific cache design. Therefore, an analysis tool capable of exhibiting the reason of cache misses has been developed. The results of the analysis ...
متن کامل